Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055

c-blake · 2020-11-19T20:08:16Z

Also add a couple convenience iterators and update changelog.md.

The basic idea here is to process the ASCII just once in a combination
validation/classification/parse rather than validating/classifying a
number in parsejson.parseNumber, then doing the same work again in
parseFloat/parseInt and finally doing it a 3rd time in the C library
strtod/etc. The last is especially slow on some libc's/common CPUs
due to long double/80-bit extended precision arithmetic on each digit.
In any event, the new fully functional parseNumber is not actually much
more code than the old classifying-only parseNumber.

Because we aim to do this work just once and the output of the work is
an int64|float, this PR adds those exported fields to JsonParser. (It
also documents two existing but undocumented fields.)

One simple optimization done here is pow10. It uses a small lookup
table for the power of 10 multiplier. The full 2048*8B = 16 KiB one is
bigger than is useful. In truth most numbers in any given run will
likely be of similar orders of magnitude, meaning the cache cost would
not be heavy, but probably best to not rely only likelihood. So fall
back to a fast integer-exponentiation algorithm when out of range.

The new parseNumber itself is more or less a straight-line parse of
scientific notation where the '.' can be anywhere in the number. To do
the right power-of-10 scaling later, we need to bookkeep a little more
than the old parseNumber as noted in side comments in the code.

Note that calling code that always wants a float even if the textual
representation looks like an integer can do something like if p.tok == tkInt: float(p.i) else: p.f or similar, depending on context.

Note that since we form the final float by integer * powerOf10 and since
powers of 10 have some intrinsic binary representational error and since
some alternative approaches (on just some CPUs) use 80-bit floats, it is
possible for the number to mismatch those other approaches in the least
significant mantissa bit (or for the ASCII->binary->ASCII round-trip to
not be the identity). On those CPUs only, better matching results can
maybe be gotten from an emit using long double if desired (also with
a long double table for powers of 10 and the powers of 10 calculation).
This strikes me as unlikely to be truly needed long-term, though.

…input. Also add a couple convenience iterators and update `changelo.md`. The basic idea here is to process the ASCII just once in a combination validation/classification/parse rather than validating/classifying a number in `parsejson.parseNumber`, then doing the same work again in `parseFloat`/`parseInt` and finally doing it a 3rd time in the C library `strtod`/etc. The last is especially slow on some libc's/common CPUs due to `long double`/80-bit extended precision arithmetic on each digit. In any event, the new fully functional `parseNumber` is not actually much more code than the old classifying-only `parseNumber`. Because we aim to do this work just once and the output of the work is an int64|float, this PR adds those exported fields to `JsonParser`. (It also documents two existing but undocumented fields.) One simple optimization done here is pow10. It uses a small lookup table for the power of 10 multiplier. The full 2048*8B = 16 KiB one is bigger than is useful. In truth most numbers in any given run will likely be of similar orders of magnitude, meaning the cache cost would not be heavy, but probably best to not rely only likelihood. So fall back to a fast integer-exponentiation algorithm when out of range. The new `parseNumber` itself is more or less a straight-line parse of scientific notation where the '.' can be anywhere in the number. To do the right power-of-10 scaling later, we need to bookkeep a little more than the old `parseNumber` as noted in side comments in the code. Note that calling code that always wants a `float` even if the textual representation looks like an integer can do something like `if p.tok == tkInt: float(p.i) else: p.f` or similar, depending on context. Note that since we form the final float by integer * powerOf10 and since powers of 10 have some intrinsic binary representational error and since some alternative approaches (on just some CPUs) use 80-bit floats, it is possible for the number to mismatch those other approaches in the least significant mantissa bit (or for the ASCII->binary->ASCII round-trip to not be the identity). On those CPUs only, better matching results can maybe be gotten from an `emit` using `long double` if desired (also with a long double table for powers of 10 and the powers of 10 calculation). This strikes me as unlikely to be truly needed long-term, though.

c-blake · 2020-11-19T21:24:56Z

I won't be able to track down what is going wrong in the tests until at least tomorrow and maybe not until Monday if anyone is in a rush and wants to help. I did test it on parsing a whole big file through the json module interface and various other ways. So, it's probably not much to get it working.

Also, here is a link back to the whole, long forum thread that inspired this.

c-blake · 2020-11-19T21:27:55Z

Also, instead of slice assignment we could probably my.a.setLen and copyMem for a little boost in the slower str(Integers|Floats)=true mode.

slice assignment. Within 1.03x of non-copy mode in a PGO build (1.10x if the caller must save the last valid string). One might think that this argues for ditching `strFloats` & `strIntegers` entirely and just always copying. In some sense, it does. However, the non-copy mode is a useful common case here, as shown in `jsonTokens(string)` example code. So, it is a bit of a judgement call whether supporting that last 10% of perf in a useful common case matters.

c-blake · 2020-11-20T09:01:05Z

Does anyone know why I am getting "undeclared identifier: 'copyMem'"s? That seems weird.

c-blake · 2020-11-20T09:06:50Z

Nevermind. I guess NimScript does not support copyMem. That seems kinda lame. :(

…yMem`.

(Someone should make `copyMem` just work everywhere, IMO.)

Also, extract mess into a template.

…ields.

…ions.

c-blake · 2020-11-20T11:11:31Z

This is only half related to this PR, but in the json module, the new rawIntegers & rawFloats parameters are not always exposed/handled well. E.g. parseJsonFragments should take all 3 raws (StringLiterals, too) & pass them to p.open, but not to p.parseJson.

float has only 15.95 decimal digits).

c-blake · 2020-11-20T12:09:07Z

I have a question. The very end of tests/stdlib/tjson.nim converts 100 '9's into a string. 9.999..e100 is well within the range of a double which is really what JSON Number's really are. Would it not make more sense to produce a Nim float?

c-blake · 2020-11-20T12:23:31Z

Never mind. I am reading the long discussion. { A lot of never mind's in this one. That'll teach me to code before coffee. ;-) }

juancarlospaco · 2020-11-20T12:35:25Z

lib/pure/parsejson.nim

+    e = e shr 1
+    base *= base
+
+proc parseNumber(my: var JsonParser): TokKind {.inline.} =


is not the proc kinda too big to be {.inline.} ?. 🤔

No. In the by far most common case it is just one line (two almost always perfectly predicted compare-jumps and one array reference).

It is only used just once, for human mental abstraction. So, there is no duplicated generated code.

Also, people are too sparing of {.inline.} in my view. Function call overhead is like 12..15 cycles * 3..6way superscalar =~ 36..90 dynamic instruction slots. So, if duplication is not a concern, that is the the rough amount of work to compare to.

If you were really going to browbeat me over inline then you should complaining about parseNumber being inline, not pow10. ;-) {EDIT: oh, wait, that's what you were doing! Sorry! } But many numbers to parse might be just single-digits which take only a few cycles. In fact, the more numbers you have the more likely many are single digits. So, the function call overhead would then be quite large compared to the work done.

Another thing inline buys you here is putting parseNumber C code in the same translation unit as the parsing itself. So, the C compilers metrics/heuristics for inlining can at least be applied. Otherwise you rely on LTO which a lot of people do not enable. Given that this whole bit of work was to get really fast speed, it seems appropriate for that reason as well as the "many single digit" cases. If you do PGO builds the compiler will actually try to measure if inlining is good or not. According to the GCC guys that is the main benefit of PGO. I am always getting 1.5x to 2.0x speed ups from PGO with Nim which suggests to me that our {.inline.} choices are not generally so great (or at least a few critical choices are not...).

Oh, and also parseNumber is also only called in one place in getTok. So, still no extra L1 I-cache style resource hit, and I also added {.inline.} on getTok to get the meat of all this logic into the translation unit of whatever is using the json parser.

Also, Nim's {.inline.} just marks it in the generated C as inline. So, it's still up to C/C++ compiler code metrics what ultimately happens. Those surely do take into account the statically observable number of expansions/substitutions and their sizes. So, the many calls to getTok from json are not a real problem. But if anyone really wants me to remove them I can. I always use (and recommend using) PGO when performance matters, myself.

this is the only remaining problem. Before merging we should make overflow detection more precise than just a digit count for both `tkInt` and `tkFloat`.

c-blake · 2020-11-20T14:40:07Z

@Araq WARNING: Not ready to merge, though it passes all checks. (See final commit comment.)

c-blake · 2020-11-20T18:38:07Z

Question probably for @Araq and all those in the prior discussion of this:

Integer overflow going from JSON to a Nim int64 (or uint64 for that matter) is well defined because there is only one way for it to happen. On the other hand, floating point is "two dimensional". A string can overflow the precision (when it must be a long string) or overflow the range (with a too big or too small exponent) or both. Exponent overflow is pretty ordinary and unproblematic. Both exponent and integer can probably even have little 10-entry digit-indexed tables to check pre-multiply.

The precision dimension seems tricky, though. I know that sometimes people format floats to 18..20 digits just to try to get better round trips through 80-bit FP formats. I don't know how frequently other prog.lang json libs do this, though. Does anyone know? I avoid json due to it being atrociously 30-200x slower than binary formats and consequently don't have a lot of experience with the JSON libs. If they never do more than 16 then there is no problem.

If json libs frequently/ever emit an extra digit or 3 a lot then it seems a bit overboard to have 17..20 decimal digits alone launch things into arbitrary precision float arithmetic on the Nim receiver. I guess someone could hook into a preprocessor that truncated/rounded extra digits, but blech. Ultimately, if one doesn't really know how the ASCII was generated then client code needs to provide guidance for this "auto promotion to JString/arb.prec float" idea. So, it seems like it's reason enough for yet another new parameter, maxNativeFloatDecimals. However, these interfaces are getting pretty busy/crowded.

So, it seemed like asking how people would like this handled was warranted. With the below idea a set[enum] is starting to smell more user friendly. We're only like 3 weeks into this raw* era. So, I am taking it as maybe still changeable.

Also, it seems like one thing that never came up on that overflow discussion was the idea of storing the uint64 in an int64 via cast[int64]() but setting a flag in the parser so that client code knows it should cast[uint64]() back. I'm not sure this is the best answer, but it wasn't really discussed and also seems to solve everyone's issues - performance "right at 64-bits unsigned" for @timotheecour, backward compat for @disruptek et al.

This would mean client code expecting ints in the range 2^63..2^64-1 needs to check a bool flag and maybe do a cast[uint64](num) (but client code may be casting unconditionally already if it is expecting uint64).

Now, I know your first reaction is probably that converting 2^63 into -2^63 seems like a keeerazy footgun. However, if I understand correctly, before an exception was just raised for this data. So, client code had to try/except/do something special anyway to not fail. This would just change what it needs to do to not fail. Another new bool flag could also control whether you get the old exception or new cast behavior. Best of all worlds?

I will probably hold off making these overflow conditions more precise until Monday/I hear back from various folks on these questions. Alternatively, I am also fine if you want to merge this & let the imprecise test stand momentarily while someone else works on said overflow checking and/or API issues.

disruptek · 2020-11-20T19:15:45Z

The bool flag on a parseJson makes sense to me, but I'd make it a proper options = set[enum] so we can extend it further going forward. I trust whatever you come up with here; it seems like this code is moving in the right direction -- thank you. 👍

c-blake · 2020-11-20T21:44:41Z

Also, if "ever storing" the uint64 in an int64 bothers anyone too much, we could add a new field to JsonParser that just mirrors the bit pattern in all cases. I think then cast could be avoided and users expecting uint64s could just access that field (or call getUint). We would still probably want an options: set[enum] parameter instead of 5 bool parameters. This seems close to what @timotheecour & @krux02 were lobbying for without breaking old code.

c-blake · 2020-11-22T12:44:55Z

Well, if I get no more feedback on the future direction here, I will just do something tomorrow. I'm a little surprised, as the prior discussion seemed pretty passionate.

Araq · 2020-11-23T10:43:35Z

lib/pure/parsejson.nim

+  finally:
+    p.close()
+
+iterator jsonTokens*(path: string, rawStringLiterals = false,


These API additions are not good for two reasons:

An iterator that always produces the same value is a foreign concept.

It looks like they are unused. :-)

They're unused because the idea was to provide something like json.parseJsonFragments but in the raw parsing parsejson context. I think they are also not normal in yielding a ptr JsonParser, but that actually made like 1.5X performance difference with the new faster logic because the JsonParser object is big enough the alloc/copies showed up a the top of the profile. It mostly seemed the easiest way to really streamline the usage syntax, as in here. I'm open to alternative suggestions to streamline said syntax, or just removing them and leaving them to the user, though.

I was more concerned about your feedback about int64 but not uint64 overflow ideas, going down the options = set[enum] route and adding a new u: uint64 field for users to conditionally/unconditionally access when they maybe/definitely expect unsigned, and that sort of direction. That's much more to do/get right than, say, moving those iterators to example code/doing something else. So, if you hated the general direction I wanted to hold off on that work.

The return type could be lent T but I don't know of today's compiler accepts that (it really shouldn't without a spec addition...).

For now please remove these iterators, it's easy enough to write a open/close/loop snippet.

I was more concerned about your feedback about int64 but not uint64 overflow ideas, going down the options = set[enum] route and adding a new u: uint64 field for users to conditionally/unconditionally access when they maybe/definitely expect unsigned, and that sort of direction.

I would use a single int64 field and then accessor procs/templates that hide the cast[uint64](x). And options = set[enum] is the better design yes. I couldn't use it because of backwards compatibility reasons. There is a different solution that deals with the compat problems but it's annoying to do, maybe it's time to make .since a builtin that works better.

nim-lang#16055 (comment)

planetis-m · 2020-11-24T12:23:28Z

Can this be disabled though? In the case of packedjson there is no reason to parse the number, it's stored as a string.

c-blake · 2020-11-24T12:30:01Z

Yes. We can add a path based on rawIntegers|rawFloats (passed through) that only validates/frames the numeric part syntactically without building a binary answer. That is a good idea and not too hard. Thumbs up.

At the moment, though, I am working on the trickier problem of detecting arbitrary precision float overflow/underflow. :-) (The integer case is easy. glibc actually uses a full bignum library to parse floats in strtod() which is also why it is so slow.)

c-blake · 2020-11-24T19:55:46Z

Do not merge this. I am working on a cleaner and more general approach in light of edge cases.

@c-blake

This uses @c-blake's code from nim-lang/Nim#16055 to avoid having to copy the string which we parse and instead directly parses a number into either an int or a string. Note that this implementation has some edge cases that break it. In particular a number that starts valid, but doesn't end as a valid number (which is technically ok, but will break our CSV parser, since our position in the memfile will *not* be on the next separator!).

@c-blake

This uses @c-blake's code from nim-lang/Nim#16055 to avoid having to copy the string which we parse and instead directly parses a number into either an int or a string. Note that this implementation has some edge cases that break it. In particular a number that starts valid, but doesn't end as a valid number (which is technically ok, but will break our CSV parser, since our position in the memfile will *not* be on the next separator!).

FedericoCeratto · 2021-06-03T21:42:12Z

Related issues: #12833 #3809 #12152 nim-lang/RFCs#188

stale · 2022-06-04T02:18:43Z

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

c-blake mentioned this pull request Nov 20, 2020

[performance] performance tricks to improve std/json, parsers, lookups nim-lang/RFCs#188

Open

timotheecour added the Performance label Nov 20, 2020

c-blake added 7 commits November 20, 2020 04:09

NimScript uses json and thus parsejson, but does not support `cop…

46d5f21

…yMem`.

Gack. I guess one needs defined(nimscript) not nimvm.

ad4707c

(Someone should make `copyMem` just work everywhere, IMO.)

Ok. This jsOrVmBlock works for streams so it should work here.

b802eba

Also, extract mess into a template.

Fix empty container problem.

0e66b44

Have getInt & getFloat use the (also exported) always populated f…

05a7350

…ields.

Fix assertion; kind = jsonError here.

136fe92

These two spots should always have used the get(Int|Float) abstract…

7fd9287

…ions.

c-blake added 3 commits November 20, 2020 06:24

Fix round-trip in test being of by just the least significant bit.

5b98112

Fix integer overflow errors. 18 digits is still plenty (IEEE 64-bit

61f39cd

float has only 15.95 decimal digits).

Oops..fix comment, too.

822d91c

juancarlospaco reviewed Nov 20, 2020

View reviewed changes

This may pass all tests, but it is preliminary to see if, in fact,

88d9845

this is the only remaining problem. Before merging we should make overflow detection more precise than just a digit count for both `tkInt` and `tkFloat`.

Araq reviewed Nov 23, 2020

View reviewed changes

Remove token iterators as requested by

7a076c2

nim-lang#16055 (comment)

c-blake added 2 commits November 23, 2020 12:40

Replace public fields with accessors updating comments a bit.

24192b5

Oops - correct isGiant implementation to do what was intended.

e8c234f

stale bot added the stale Staled PR/issues; remove the label after fixing them label Jun 4, 2022

stale bot closed this Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055

Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055

c-blake commented Nov 19, 2020

c-blake commented Nov 19, 2020 •

edited

Loading

c-blake commented Nov 19, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

juancarlospaco Nov 20, 2020

c-blake Nov 20, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020 •

edited

Loading

disruptek commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 22, 2020

Araq Nov 23, 2020

c-blake Nov 23, 2020

Araq Nov 23, 2020

planetis-m commented Nov 24, 2020

c-blake commented Nov 24, 2020

c-blake commented Nov 24, 2020

FedericoCeratto commented Jun 3, 2021

stale bot commented Jun 4, 2022

Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055

Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055

Conversation

c-blake commented Nov 19, 2020

c-blake commented Nov 19, 2020 • edited Loading

c-blake commented Nov 19, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020

juancarlospaco Nov 20, 2020

Choose a reason for hiding this comment

c-blake Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

c-blake Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

c-blake Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

c-blake Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

c-blake commented Nov 20, 2020

c-blake commented Nov 20, 2020 • edited Loading

disruptek commented Nov 20, 2020

c-blake commented Nov 20, 2020

c-blake commented Nov 22, 2020

Araq Nov 23, 2020

Choose a reason for hiding this comment

c-blake Nov 23, 2020

Choose a reason for hiding this comment

Araq Nov 23, 2020

Choose a reason for hiding this comment

planetis-m commented Nov 24, 2020

c-blake commented Nov 24, 2020

c-blake commented Nov 24, 2020

FedericoCeratto commented Jun 3, 2021

stale bot commented Jun 4, 2022

c-blake commented Nov 19, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake Nov 20, 2020 •

edited

Loading

c-blake commented Nov 20, 2020 •

edited

Loading